There are roughly 45 thousand compounds in the DTP NCI NIH Cell Line data. Of those We were able to find just shy of 42K unique canonicalized SMILES. There are 159 cell lines, 60 of which have gene expression data as well. Most of the compounds are screened against the 60 cell lines (in the neighborhood of between 60%-98%), and the number of compounds screened against the remaining 99 cell lines falls off fairly quickly.
We took the Growth Inhibition data (GI50) and created a database of just those entries for which we have canonicalized SMILES. We then chose the cell line with the most screened compounds (nearly 42K), “A549/ATCC”. We found those compounds had a –log GI50 (NLGI50) of greater than 5.0. The the Wang, et al paper (JCIM, web 10/4/07) suggests that there is biological evidence that 4.0 is a cutoff of biological activity, though they don’t reference where this number comes from. They then chose equal to or greater than 5.0 as their activity cut off for reasons that have to do with the activity distribution. We chose greater than 5 to as a cutoff to bias the set towards the more active compounds. We found 5550 active compounds with the greater than 5.0 cutoff for this cell line. We can assume that most of these are likely active. The average NLGI50 STD for those compounds screened more than once was small, and the median was even smaller, suggesting that it is very rare for a compound to have a STD of 0.5 NLGI50. Thus, compounds screened above 5 almost never where screened again with values of 4 or below. Those of 5 or below are most likely not active or weakly active.
We selected the active set sample of 155 compounds by clustering all those compounds (5550 active set), then selecting the largest cluster at a 0.7 Tanimoto similarity threshold. The fingerprints used were the Mesa 768 keys. "Mesa" is somewhat of a misnomer, as they are mostly a selection of the MACCS keys -- a subset of the union of the 166, 320, and 966, plus a handful from the Pubchem keys that extend the 966. We used the bit representation of this key-based fingerprint. We also clustered the count-based version, and Norah (MacCuish) compared the compounds between the bit and count versions and she selected the bit produced cluster. The clustering was done with Taylor-Butina clustering (exclusion region clustering). Though the compounds are somewhat alike, they are only very loosely a “series”. A quick glance at the SMILES and depictions one gets the feeling that there are several series, somewhat loosely related. 0.7 Tanimoto threshold is a fairly broad similarity threshold.
The inactives set of 150 was created by selecting the representative compound in the active set ("centroid", or the "centrotype" to use the obscure statistical term), and performing a similarity search using again the bit-based fingerprint against the inactive set: 35133 compounds for this cell line <= 5. We selected the top 150 compounds, using a Tanimoto threshold of above 0.67 to get 150 compounds.
The inactives contain mostly NLGI values of 4, or between 4 and 5. Having the spread of activity is good from the stand point of regression -- classification may be a bit more dodgy and we may need to see if there is a natural break between the two sets for this cell line that differs from the greater than 5 NLGI cutoff.
We also created a diverse inactive subset of 150 compounds that comprise a very broad group of compounds selected from the inactives. We clustered the 35133 inactives and selected the representatives of the top 150 clusters at a threshold of 0.7 Tani similarity. They are a biased "diverse" set, in that they are more likely to be from individual series within the data set and not from very small clusters or singletons of which there are many (~5 thousand) at the 0.7 threshold.
We should at least be able to build a classifier with 2 or 3 D properties that distinguishes this set from the actives. The harder test will be how good a model we can build using the inactives that are more like the actives -- especially in the case of a regression model.
We also generated selectivity information for all of the active and inactive compounds – namely how often were they active for any of the other 159 cell lines. I have not yet included this information in either sample set, but we can include it if it is deemed of interest. Many of the compounds appear to be fairly promiscuous.
The selectivity information took a fair amount of cpu using R -- about 6.5 hours. The remaining work, generating fingerprinter, clustering, similarity searching, etc. very little cpu – and some time fiddling with commandline scripts, getting the data in and out of R, etc.
Programs Used
babel2 to convert NCI data sd file to SMILES file.
Canonicalize (a Mesa program built with OEChem) to convert the smi to canonicalized smi and to carry along any ancillary data columns.
Shell scripting to obtain just the unique canonicalized smiles with their respective data and IDs
gen_mesa768 (the Mesa fingerprinter using OEChem) used in two modes, bit and counts to convert the smi to fps or cnts)
Measures (a Mesa Grouping Module program, the sequential version of the Measures program) for either similarity searching (MXN) or cluster matrix generation(NXN or sparse NXN), using fps or cnts
Clustering (a Mesa Grouping Module program, the sequential version of the clustering) to cluster the matrices of the Measures program.
ClusterOutput (a Mesa Grouping Module program that converts the clustering output to various forms and formats -- e.g., return just representative compounds) merges cluster output with the data file (smi, ID, activity,...) to return clusters and cluster representatives and clustering statistics, etc.
SimilarityOutput (a Mesa Grouping Module program that converts Measures similarity output to a rank ordered format with the data files used (smi, ID, activity,...).
R to select the actives and inactives from the cell line with the most compounds screened, and to extract the selectivity information.
ChemTattoo (a Mesa program using OEChem with ChemAxon marvin tools as a depictor) to look at the various actives and inactives results from the programs above, and check for substructure commonality.
Sample Data
(Wang et al. paper 2007)
There are roughly 45 thousand compounds in the DTP NCI NIH Cell Line data. Of those We were able to find just shy of 42K unique canonicalized SMILES. There are 159 cell lines, 60 of which have gene expression data as well. Most of the compounds are screened against the 60 cell lines (in the neighborhood of between 60%-98%), and the number of compounds screened against the remaining 99 cell lines falls off fairly quickly.
We took the Growth Inhibition data (GI50) and created a database of just those entries for which we have canonicalized SMILES. We then chose the cell line with the most screened compounds (nearly 42K), “A549/ATCC”. We found those compounds had a –log GI50 (NLGI50) of greater than 5.0. The the Wang, et al paper (JCIM, web 10/4/07) suggests that there is biological evidence that 4.0 is a cutoff of biological activity, though they don’t reference where this number comes from. They then chose equal to or greater than 5.0 as their activity cut off for reasons that have to do with the activity distribution. We chose greater than 5 to as a cutoff to bias the set towards the more active compounds. We found 5550 active compounds with the greater than 5.0 cutoff for this cell line. We can assume that most of these are likely active. The average NLGI50 STD for those compounds screened more than once was small, and the median was even smaller, suggesting that it is very rare for a compound to have a STD of 0.5 NLGI50. Thus, compounds screened above 5 almost never where screened again with values of 4 or below. Those of 5 or below are most likely not active or weakly active.
We selected the active set sample of 155 compounds by clustering all those compounds (5550 active set), then selecting the largest cluster at a 0.7 Tanimoto similarity threshold. The fingerprints used were the Mesa 768 keys. "Mesa" is somewhat of a misnomer, as they are mostly a selection of the MACCS keys -- a subset of the union of the 166, 320, and 966, plus a handful from the Pubchem keys that extend the 966. We used the bit representation of this key-based fingerprint. We also clustered the count-based version, and Norah (MacCuish) compared the compounds between the bit and count versions and she selected the bit produced cluster. The clustering was done with Taylor-Butina clustering (exclusion region clustering). Though the compounds are somewhat alike, they are only very loosely a “series”. A quick glance at the SMILES and depictions one gets the feeling that there are several series, somewhat loosely related. 0.7 Tanimoto threshold is a fairly broad similarity threshold.
The inactives set of 150 was created by selecting the representative compound in the active set ("centroid", or the "centrotype" to use the obscure statistical term), and performing a similarity search using again the bit-based fingerprint against the inactive set: 35133 compounds for this cell line <= 5. We selected the top 150 compounds, using a Tanimoto threshold of above 0.67 to get 150 compounds.
The inactives contain mostly NLGI values of 4, or between 4 and 5. Having the spread of activity is good from the stand point of regression -- classification may be a bit more dodgy and we may need to see if there is a natural break between the two sets for this cell line that differs from the greater than 5 NLGI cutoff.
We also created a diverse inactive subset of 150 compounds that comprise a very broad group of compounds selected from the inactives. We clustered the 35133 inactives and selected the representatives of the top 150 clusters at a threshold of 0.7 Tani similarity. They are a biased "diverse" set, in that they are more likely to be from individual series within the data set and not from very small clusters or singletons of which there are many (~5 thousand) at the 0.7 threshold.
We should at least be able to build a classifier with 2 or 3 D properties that distinguishes this set from the actives. The harder test will be how good a model we can build using the inactives that are more like the actives -- especially in the case of a regression model.
We also generated selectivity information for all of the active and inactive compounds – namely how often were they active for any of the other 159 cell lines. I have not yet included this information in either sample set, but we can include it if it is deemed of interest. Many of the compounds appear to be fairly promiscuous.
The selectivity information took a fair amount of cpu using R -- about 6.5 hours. The remaining work, generating fingerprinter, clustering, similarity searching, etc. very little cpu – and some time fiddling with commandline scripts, getting the data in and out of R, etc.
Programs Used
Back to Simple 2D Workflow - Story